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Prototype methods seek a minimal subset of samples that can 
serve as a distillation or condensed view of a data set. As the size 
of modern data sets grows, being able to present a domain specialist 
with a short list of "representative" samples chosen from the data 
set is of increasing interpretative value. While much recent statis- 
tical research has been focused on producing sparse-in-the-variables 
methods, this paper aims at achieving sparsity in the samples. 

We discuss a method for selecting prototypes in the classification 
setting (in which the samples fall into known discrete categories). Our 
method of focus is derived from three basic properties that we believe 
a good prototype set should satisfy. This intuition is translated into 
a set cover optimization problem, which we solve approximately using 
standard approaches. While prototype selection is usually viewed as 
purely a means toward building an efficient classifier, in this paper we 
emphasize the inherent value of having a set of prototypical elements. 
That said, by using the nearest-neighbor rule on the set of prototypes, 
we can of course discuss our method as a classifier as well. 

We demonstrate the interpretative value of producing prototypes 
on the well-known USPS ZIP code digits data set and show that 
as a classifier it performs reasonably well. We apply the method to 
a proteomics data set in which the samples are strings and therefore 
not naturally embedded in a vector space. Our method is compati- 
ble with any dissimilarity measure, making it amenable to situations 
in which using a non-Euclidean metric is desirable or even neces- 
sary. 
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1. Introduction. Much of statistics is based on the notion that averaging 
over many elements of a data set is a good thing to do. In this paper, we 
take an opposite tack. In certain settings, selecting a small number of "rep- 
resentative" samples from a large data set may be of greater interpretative 
value than generating some "optimal" linear combination of all the elements 
of a data set. For domain specialists, examining a handful of representative 
examples of each class can be highly informative especially when n is large 
(since looking through all examples from the original data set could be over- 
whelming or even infeasible). Prototype methods aim to select a relatively 
small number of samples from a data set which, if well chosen, can serve as 
a summary of the original data set. In this paper, we motivate a particular 
method for selecting prototypes in the classification setting. The resulting 
method is very similar to Class Cover Catch Digraphs of Priebe et al. (2003). 
In fact, we have found many similar proposals across multiple fields, which 
we review later in this paper. What distinguishes this work from the rest is 
our interest in prototypes as a tool for better understanding a data set — that 
is, making it more easily "human-readable." The bulk of the previous liter- 
ature has been on prototype extraction specifically for building classifiers. 
We find it useful to discuss our method as a classifier to the extent that it 
permits quantifying its abilities. However, our primary objective is aiding 
domain specialists in making sense of their data sets. 

Much recent work in the statistics community has been devoted to the 
problem of interpretable classification through achieving sparsity in the vari- 
ables [Tibshirani et al. (2002), Zhu et al. (2004), Park and Hastie (2007), 
Friedman, Hastie and Tibshirani (2010)]. In this paper, our aim is inter- 
pretability through sparsity in the samples. Consider the US Postal Service's 
ZIP code data set, which consists of a training set of 7,291 grayscale (16 x 16 
pixel) images of handwritten digits 0-9 with associated labels indicating the 
intended digit. A typical "sparsity- in-the- variables" method would identify 
a subset of the pixels that is most predictive of digit-type. In contrast, our 
method identifies a subset of the images that, in a sense, is most predictive 
of digit-type. Figure 6 shows the first 88 prototypes selected by our method. 
It aims to select prototypes that capture the full variability of a class while 
avoiding confusion with other classes. For example, it chooses a wide enough 
range of examples of the digit "7" to demonstrate that some people add 
a serif while others do not; however, it avoids any "7" examples that look 
too much like a "1." We see that many more "0" examples have been chosen 
than "1" examples despite the fact that the original training set has roughly 
the same number of samples of these two classes. This reflects the fact that 
there is much more variability in how people write "0" than "1." 

More generally, suppose we are given a training set of points X = {xi , . . . , 
x„} C R p with corresponding class labels yx, . . . ,y n G {1, . . . , L}. The output 
of our method are prototype sets V\ C X for each class I. The goal is that 
someone given only V\ , . . . , Vl would have a good sense of the original train- 



PROTOTYPE SELECTION 



3 



ing data, X and y. The above situation describes the standard setting of 
a condensation problem [Hart (1968), Lozano et al. (2006), Ripley (2005)]. 

At the heart of our proposed method is the premise that the prototypes 
of class I should consist of points that are close to many training points of 
class I and are far from training points of other classes. This idea captures 
the sense in which the word "prototypical" is commonly used. 

Besides the interpretative value of prototypes, they also provide a means 
for classification. Given the prototype sets V±, . . . ,Vl, we may classify any 
new x G R p according to the class whose Vi contains the nearest prototype: 



Notice that this classification rule reduces to one nearest neighbors (1-NN) 
in the case that V\ consists of all Xj £ X with yi = l. 

The 1-NN rule's popularity stems from its conceptual simplicity, empiri- 
cally good performance, and theoretical properties [Cover and Hart (1967)]. 
Nearest prototype methods seek a lighter- weight representation of the train- 
ing set that does not sacrifice (and, in fact, may improve) the accuracy of the 
classifier. As a classifier, our method performs reasonably well, although its 
main strengths lie in the ease of understanding why a given prediction has 
been made — an alternative to (possibly high-accuracy) "black box" meth- 
ods. 

In Section 2 we begin with a conceptually simple optimization criterion 
that describes a desirable choice for 'Pi,...,7- , l. This intuition gives rise 
to an integer program, which can be decoupled into L separate set cover 
problems. In Section 3 we present two approximation algorithms for solving 
the optimization problem. Section 4 discusses considerations for applying our 
method most effectively to a given data set. In Section 5 we give an overview 
of related work. In Section 6 we return to the ZIP code digits data set and 
present other empirical results, including an application to proteomics. 

2. Formulation as an optimization problem. In this section we frame 
prototype selection as an optimization problem. The problem's connection 
to set cover will lead us naturally to an algorithm for prototype selection. 

2.1. The intuition. Our guiding intuition is that a good set of proto- 
types for class / should capture the full structure of the training examples of 
class / while taking into consideration the structure of other classes. More 
explicitly, every training example should have a prototype of its same class 
in its neighborhood; no point should have a prototype of a different class in 
its neighborhood; and, finally, there should be as few prototypes as possible. 
These three principles capture what we mean by "prototypical." Our method 
seeks prototype sets with a slightly relaxed version of these properties. 
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Fig. 1. Given a value for e, the choice ofVi,... ,Vl induces L partial covers of the train- 
ing points by e-balls. Here e is varied from the smallest (top-left panel) to approximately 
the median interpoint distance (bottom-right panel). 

As a first step, we make the notion of neighborhood more precise. For 
a given choice of P| C ^, we consider the set of e-balls centered at each 
Xj € Vi (see Figure 1). A desirable prototype set for class / is then one that 
induces a set of balls which: 

(a) covers as many training points of class I as possible, 

(b) covers as few training points as possible of classes other than I, and 

(c) is sparse (i.e., uses as few prototypes as possible for the given e). 

We have thus translated our initial problem concerning prototypes into 
the geometric problem of selectively covering points with a specified set of 
balls. We will show that our problem reduces to the extensively studied set 
cover problem. We briefly review set cover before proceeding with a more 
precise statement of our problem. 

2.2. The set cover integer program. Given a set of points X and a col- 
lection of sets that forms a cover of X, the set cover problem seeks the 
smallest subcover of X . Consider the following special case: Let -B(x) = 
{x ; € R p : d(x' , x) < e} denote the ball of radius e > centered at x (note: d 
need not be a metric). Clearly, {Z?(xj) :Xj € X} is a cover of X. The goal is 
to find the smallest subset of points PC^ such that {.B(xj) :Xj € V} cov- 
ers X (i.e., every Xj £ X is within e of some point in V). This problem can 
be written as an integer program by introducing indicator variables: ay = 1 
if Xj € V and ctj = otherwise. Using this notation, x ,eB(xj) a j coun ts 
the number of times Xj is covered by a B(xj) with Xj 6 V . Thus, requiring 
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that this sum be positive for each Xj € X enforces that V induces a cover 
of X. The set cover problem is therefore equivalent to the following integer 
program: 

n 

minimize ^Jay s.t. aj > 1 Vxj G X , 

j=l j:x t eB(x.j) 

^ r , 

aj G {0, 1} Vxj € X. 

A feasible solution to the above integer program is one that has at least one 
prototype within e of each training point. 

Set cover can be seen as a clustering problem in which we wish to find the 
smallest number of clusters such that every point is within e of at least one 
cluster center. In the language of vector quantization, it seeks the smallest 
codebook (restricted to X) such that no vector is distorted by more than e 
[Tipping and Scholkopf (2001)]. It was the use of set cover in this context 
that was the starting point for our work in developing a prototype method 
in the classification setting. 

2.3. From intuition to integer program. We now express the three prop- 
erties (a)-(c) in Section 2.1 as an integer program, taking as a starting point 
the set cover problem of (2). Property (b) suggests that in certain cases it 
may be necessary to leave some points of class I uncovered. For this reason, 
we adopt a prize- collecting set cover framework for our problem, meaning 
we assign a cost to each covering set, a penalty for being uncovered to each 
point and then find the minimum-cost partial cover [Konemann, Parekh and 

Segev (2006)]. Let cc? € {0, 1} indicate whether we choose Xj to be in V\ (i.e., 

to be a prototype for class I). As with set cover, the sum Ylj-xneBtx ) a f^ 
counts the number of balls B(x.j) with Xj £ Vi that cover the point Xj. We 
then set out to solve the following integer program: 

minimize & + r\i + A s.t. 

af&rti i i j,i 

(3a) J2 «?°>l-fi Vxj 6 X, 

j : x,eB(xj) 

(3b) Yl af<0 + »K Vx 4 G*, 

j : XiGS(xj) 

af €{0,1} Vj,/, £i,Vi>0 Vi. 
We have introduced two slack variables, £j and r/j, per training point Xj. Con- 
straint (3a) enforces that each training point be covered by at least one ball of 
its own class- type (otherwise = 1). Constraint (3b) expresses the condition 
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that training point x, not be covered with balls of other classes (otherwise 
r\i > 0). In particular, £j can be interpreted as indicating whether Xj does 
not fall within e of any prototypes of class yi, and rji counts the number of 
prototypes of class other than in that are within e of Xj. 

Finally, A > is a parameter specifying the cost of adding a prototype. 
Its effect is to control the number of prototypes chosen [corresponding to 
property (c) of the last section]. We generally choose A = 1/n, so that prop- 
erty (c) serves only as a "tie-breaker" for choosing among multiple solutions 
that do equally well on properties (a) and (b). Hence, in words, we are min- 
imizing the sum of (a) the number of points left uncovered, (b) the number 
of times a point is wrongly covered, and (c) the number of covering balls 
(multiplied by A). The resulting method has a single tuning parameter, e 
(the ball radius), which can be estimated by cross-validation. 

We show in the Appendix that the above integer program is equivalent 
to L separate prize-collecting set cover problems. Let X\ = {xj G X : yi = I}. 
Then, for each class I, the set V\ C X is given by the solution to 

m 

minimize Y]Cl(j) a j + Y] & s.t. 
(4) YI a f> l ~^i Vxi€*,, 

j : XieB(xj) 

afe{0,l} Vj, ^ > VtrXiGAT,, 

where Ci(j) = A + \B(x.j) n(X\Xi) \ is the cost of adding Xj to Vi and a unit 
penalty is charged for each point x, of class I left uncovered. 

3. Solving the problem: Two approaches. The prize-collecting set cover 
problem of (4) can be transformed to a standard set cover problem by con- 
sidering each slack variable £j as representing a singleton set of unit cost 
[Konemann, Parekh and Segev (2006)]. Since set cover is NP-hard, we do 
not expect to find a polynomial-time algorithm to solve our problem exactly. 
Further, certain inapproximability results have been proven for the set cover 
problem [Feige (1998)]. 3 In what follows, we present two algorithms for ap- 
proximately solving our problem, both based on standard approximation 
algorithms for set cover. 

3.1. LP relaxation with randomized rounding. A well-known approach 
for the set cover problem is to relax the integer constraints ap 6 {0, 1} by 
replacing it with < an < 1. The result is a linear program (LP), which 



3 We do not assume in general that the dissimilarities satisfy the triangle inequality, so 
we consider arbitrary covering sets. 
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is convex and easily solved with any LP solver. The result is subsequently 
rounded to recover a feasible (though not necessarily optimal) solution to 
the original integer program. 

Let {a* , . . . , ctm } U {£* : i s.t. Xj G X{\ denote a solution to the LP re- 
laxation of (4) with optimal value OPT^p. Since a* , £* G [0,1], we may 
think of these as probabilities and round each variable to 1 with probability 
given by its value in the LP solution. Following Vazirani (2001), we do this 
0(log|AJj|) times and take the union of the partial covers from all iterations. 

We apply this randomized rounding technique to approximately solve (4) 
for each class separately. For class I, the rounding algorithm is as follows: 



• Initialize A\ 1 = ■■■ = A„{ = and Si = Vi : x 4 e X x . 

• For i = l,...,21og|A'i|: 

(1) Draw independently Aj ~ Bernoulli(a*^) and Si ~ Bernoulli (£*). 

(2) Update Af :=max(Af\A^) and $ :=max(S t ,S l ). 

• U{Af\Si} is feasible and has objective < 2 log X t |OPT^, return V\ = {x 3 - 6 
X : Aj = 1}. Otherwise repeat. 



In practice, we terminate as soon as a feasible solution is achieved. If after 
2 log | | steps the solution is still infeasible or the objective of the rounded 
solution is more than 21og|A/| times the LP objective, then the algorithm 
is repeated. By the analysis given in Vazirani (2001), the probability of this 
happening is less than 1/2, so it is unlikely that we will have to repeat 
the above algorithm very many times. Recalling that the LP relaxation 
gives a lower bound on the integer program's optimal value, we see that 
the randomized rounding yields a O (log \Xi\) -factor approximation to (4). 
Doing this for each class yields overall a 0(K log N) -factor approximation 
to (3), where N = m&xi\Xi\. We can recover the rounded version of the slack 

variable Vi by T; = £V . x . eB(Xj .) Af . 

One disadvantage of this approach is that it requires solving an LP, which 
we have found can be relatively slow and memory-intensive for large data 
sets. The approach we describe next is computationally easier than the LP 
rounding method, is deterministic, and provides a natural ordering of the 
prototypes. It is thus our preferred method. 

3.2. A greedy approach. Another well-known approximation algorithm 
for set cover is a greedy approach [Vazirani (2001)]. At each step, the pro- 
totype with the least ratio of cost to number of points newly covered is 
added. However, here we present a less standard greedy algorithm which 
has certain practical advantages over the standard one and does not in our 




experience do noticeably worse in minimizing the objective. At each step we 
find the Xj € X and class I for which adding Xj to V\ has the best trade- 
off of covering previously uncovered training points of class / while avoid- 
ing covering points of other classes. The incremental improvement of going 
from (Vi, . . . , V L ) to (Pi, ... , Vi-i,Vi U {xj}, . . . , V L ) can be denoted 
by A Obj(xj, I) = A£(xj, /) — Ary(xj, I) — A, where 




Ari(?L j ,l) = \B(xj)n(X\X l )\. 
The greedy algorithm is simply as follows: 



(1) Start with Vi = for each class I. 

(2) While AObj(x*,/*) > 0: 

• Find (x*,T) = argmax( Xi)i ) AObj(x i ,/). 

• Let Vi* :=Vi* U{x*}. 



Figure 2 shows a performance comparison of the two approaches on the 
digits data (described in Section 6.2) based on time and resulting (integer 
program) objective. Of course, any time comparison is greatly dependent on 
the machine and implementation, and we found great variability in running 
time among LP solvers. While low-level, specialized software could lead to 
significant time gains, for our present purposes, we use off-the-shelf, high- 
level software. The LP was solved using the R package Rglpk, an interface 
to the GNU Linear Programming Kit. For the greedy approach, we wrote 
a simple function in R. 
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4. Problem-specific considerations. In this section we describe two ways 
in which our method can be tailored by the user for the particular problem 
at hand. 

4.1. Dissimilarities. Our method depends on the features only through 
the pairwise dissimilarities d(xj,x,), which allows it to share in the benefits 
of kernel methods by using a kernel-based distance. For problems in the 
p> n realm, using distances that effectively lower the dimension can lead 
to improvements. Additionally, in problems in which the data are not readily 
embedded in a vector space (see Section 6.3), our method may still be applied 
if pairwise dissimilarities are available. Finally, given any dissimilarity d, we 
may instead use d, defined by d(x, z) = |{xj G X : cf(xj, z) < d(x, z)}|. Using d 
induces e-balls, B(x.j), consisting of the (|_ej — 1) nearest training points 

to Xj. 

4.2. Prototypes not on training points. For simplicity, up until now we 
have described a special case of our method in which we only allow proto- 
types to lie on elements of the training set X. However, our method is easily 
generalized to the case where prototypes are selected from any finite set of 
points. In particular, suppose, in addition to the labeled training data X 
and y, we are also given a set Z = {zi, . . . ,z m } of unlabeled points. This 
situation (known as semi-supervised learning) occurs, for example, when 
it is expensive to obtain large amounts of labeled examples, but collecting 
unlabeled data is cheap. Taking Z as the set of potential prototypes, the 
optimization problem (3) is easily modified so that V\ , ■ . ■ , Vl are selected 
subsets of Z. Doing so preserves the property that all prototypes are actual 
examples (rather than arbitrary points in R p ). 

While having prototypes confined to lie on actual observed points is de- 
sirable for interpretability, if this is not desired, then Z may be further 
augmented to include other points. For example, one could run ET-means 
on each class's points individually and add these L ■ K centroids to Z. This 
method seems to help especially in high-dimensional problems where con- 
straining all prototypes to lie on data points suffers from the curse of di- 
mensionality. 

5. Related work. Before we proceed with empirical evaluations of our 
method, we discuss related work. There is an abundance of methods that 
have been proposed addressing the problem of how to select prototypes 
from a training set. These proposals appear in multiple fields under different 
names and with differing goals and justifications. The fact that this problem 
lies at the intersection of so many different literatures makes it difficult to 
provide a complete overview of them all. In some cases, the proposals are 
quite similar to our own, differing in minor details or reducing in a special 
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case. What makes the present work different from the rest is our goal, which 
is to develop an interpretative aid for data analysts who need to make sense 
of a large set of labeled data. The details of our method have been adapted to 
this goal; however, other proposals — while perhaps intended specifically as 
a preprocessing step for the classification task — may be effectively adapted 
toward this end as well. In this section we review some of the related work 
to our own. 

5.1. Class cover catch digraphs. Priebe et al. (2003) form a directed 
graph Dj~ = (Xk,Ek) for each class k where (xj,x,-) £ if a ball centered 
at Xj of radius n covers x 3 -. One choice of r, is to make it as large as pos- 
sible without covering more than a specified number of other-class points. 
A dominating set of Dk is a set of nodes for which all elements of X\~ are 
reachable by crossing no more than one edge. They use a greedy algorithm 
to find an approximation to the minimum dominating set for each Dk- This 
set of points is then used to form the Class Cover Catch Digraph (CCCD) 
Classifier, which is a nearest neighbor rule that scales distances by the radii. 
Noting that a dominating set of Dk corresponds to finding a set of balls 
that covers all points of class k, we see that their method could also be 
described in terms of set cover. The main difference between their formula- 
tion and ours is that we choose a fixed radius across all points, whereas in 
their formulation a large homogeneous region is filled by a large ball. Our 
choice of fixed radius seems favorable from an interpretability standpoint 
since there can be regions of space which are class-homogeneous and yet for 
which there is a lot of interesting within-class variability which the proto- 
types should reveal. The CCCD work is an outgrowth of the Class Cover 
Problem, which does not allow balls to cover wrong-class points [Cannon and 
Cowen (2004)]. This literature has been developed in more theoretical direc- 
tions [e.g., DeVinney and Wierman (2002), Ceyhan, Priebe and Marchette 
(2007)]. 

5.2. The set covering machine. Marchand and Shawe- Taylor (2002) in- 
troduce the set covering machine (SCM) as a method for learning compact 
disjunctions (or conjunctions) of x in the binary classification setting (i.e., 
when L = 2). That is, given a potentially large set of binary functions of 
the features, % = {hj,j = 1, ... ,m} where hj :R P — > {0, 1}, the SCM se- 
lects a relatively small subset of functions, TZC.H, for which the prediction 
rule /(x) = \J j & ^hj(x) (in the case of a disjunction) has low training er- 
ror. Although their stated problem is unrelated to ours, the form of the 
optimization problem is very similar. 

In Hussain, Szedmak and Shawe- Taylor (2004) the authors express the 
SCM optimization problem explicitly as an integer program, where the bi- 
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nary vector a is of length m and indicates which of the hj are in 7Z: 

m / m m \ 

minimize otj + D I £j + 77^ J s.t. 

a, ^ v 7=1 \j=i «=i / 

(5) 

H+a>l-£, H^a<0 + r/, a£{0,l} m ; £,??>0. 

In the above integer program (for the disjunction case), H + is the matrix 
with ijih entry /ij(xj), with each row i corresponding to a "positive" exam- 
ple Xj and H_ the analogous matrix for "negative" examples. Disregarding 
the slack vectors £ and rj, this seeks the binary vector a for which every 
positive example is covered by at least one hj £ 7Z and for which no nega- 
tive example is covered by any hj € TZ. The presence of the slack variables 
permits a certain number of errors to be made on the training set, with the 
trade-off between accuracy and size of 1Z controlled by the parameter D. 

A particular choice for % is also suggested in Marchand and Shawe- Taylor 
(2002), which they call "data-dependent balls," consisting of indicator func- 
tions for the set of all balls with centers at "positive" Xj (and of all radii) 
and the complement of all balls centered at "negative" x». 

Clearly, the integer programs (3) and (5) are very similar. If we take H 
to be the set of balls of radius e with centers at the positive points only, 
solving (5) is equivalent to finding the set of prototypes for the positive 
class using our method. As shown in the Appendix, (3) decouples into L 
separate problems. Each of these is equivalent to (5) with the positive and 
negative classes being Xi and X\Xi, respectively. Despite this correspon- 
dence, Marchand and Shawe- Taylor (2002) were not considering the problem 
of prototype selection in their work. Since Marchand's and Shawe- Taylor's 
(2002) goal was to learn a conjunction (or disjunction) of binary features, 
they take as a classification rule /(x); since our aim is a set of prototypes, 
it is natural that we use the standard nearest-prototype classification rule 
of (1). 

For solving the SCM integer program, Hussain, Szedmak and Shawe- 
Taylor (2004) propose an LP relaxation; however, a key difference between 
their approach and ours is that they do not seek an integer solution (as we 
do with the randomized rounding), but rather modify the prediction rule to 
make use of the fractional solution directly. 

Marchand and Shawe- Taylor (2002) propose a greedy approach to solv- 
ing (5). Our greedy algorithm differs slightly from theirs in the following 
respect. In their algorithm, once a point is misclassified by a feature, no 
further penalty is incurred for other features also misclassifying it. In con- 
trast, in our algorithm, a prototype is always charged if it falls within e of 
a wrong-class training point. This choice is truer to the integer programs (3) 
and (5) since the objective has ^ • rjj rather than ^ ■ l{i]j > 0}. 
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5.3. Condensation and instance selection methods. Our method (with 
Z = X) selects a subset of the original training set as prototypes. In this 
sense, it is similar in spirit to condensing and data editing methods, such 
as the condensed nearest neighbor rule [Hart (1968)] and multiedit [Devi- 
jver and Kittler (1982)]. Hart (1968) introduces the notion of the minimal 
consistent subset — the smallest subset of X for which nearest-prototype clas- 
sification has training error. Our method's objective, Y^h=i & + Y^h=i Vi + 
^Ej( a j i represents a sort of compromise, governed by A, between con- 
sistency (first two terms) and minimality (third term). In contrast to our 
method, which retains examples from the most homogeneous regions, con- 
densation methods tend to specifically keep those elements that fall on the 
boundary between classes [Fayed and Atiya (2009)]. This difference high- 
lights the distinction between the goals of reducing a data set for good clas- 
sification performance versus creating a tool for interpreting a data set. Wil- 
son and Martinez (2000) provide a good survey of instance-based learning, 
focusing — as is typical in this domain — entirely on its ability to improve the 
efficiency and accuracy of classification rather than discussing its attractive- 
ness for understanding a data set. More recently, Cano, Herrera and Lozano 
(2007) use evolutionary algorithms to perform instance selection with the 
goal of creating decision trees that are both precise and interpretable, and 
Marchiori (2010) suggests an instance selection technique focused on having 
a large hypothesis margin. Cano, Herrera and Lozano (2003) compare the 
performance of a number of instance selection methods. 

5.4. Other methods. We also mention a few other nearest prototype 
methods. .fT-means and /T-medoids are common unsupervised methods which 
produce prototypes. Simply running these methods on each class separately 
yields prototype sets V\ , . . . , Vl ■ -ftT-medoids is similar to our method in that 
its prototypes are selected from a finite set. In contrast, if-means's proto- 
types are not required to lie on training points, making the method adaptive. 
While allowing prototypes to lie anywhere in R p can improve classification 
error, it also reduces the interpretability of the prototypes (e.g., in data 
sets where each Xj represents an English word, producing a linear combi- 
nation of hundreds of words offers little interpretative value). Probably the 
most widely used adaptive prototype method is learning vector quantization 
[LVQ, Kohonen (2001)]. Several versions of LVQ exist, varying in certain 
details, but each begins with an initial set of prototypes and then iteratively 
adjusts them in a fashion that tends to encourage each prototype to lie near 
many training points of its class and away from training points of other 
classes. 

Takigawa, Kudo and Nakamura (2009) propose an idea similar to ours 
in which they select convex sets to represent each class, and then make 
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predictions for new points by finding the set with nearest boundary. They 
refer to the selected convex sets themselves as prototypes. 

Finally, in the main example of this paper (Section 6.2), we observe that 
the relative proportion of prototypes selected for each class reveals that 
certain classes are far more complex than others. We note here that quan- 
tifying the complexity of a data set is itself a subject that has been studied 
extensively [Basu and Ho (2006)]. 

6. Examples on simulated and real data. We demonstrate the use of our 

method on several data sets and compare its performance as a classifier to 
some of the prototype methods best known to statisticians. Classification 
error is a convenient metric for demonstrating that our proposal is reason- 
able even though building a classifier is not our focus. All the methods we 
include are similar in that they first choose a set of prototypes and then use 
the nearest-prototype rule to classify. LVQ and i^-means differ from the rest 
in that they do not constrain the prototypes to lie on actual elements of the 
training set (or any prespecified finite set Z). We view this flexibility as a hin- 
derance for interpretability but a potential advantage for classification error. 

For X-medoids, we run the function pam of the R package cluster on each 
class's data separately, producing K prototypes per class. For LVQ, we use 
the functions lvqinit and olvql [optimized learning vector quantization 1, 
Kohonen (2001)] from the R package class. We vary the initial codebook 
size to produce a range of solutions. 

6.1. Mixture of Gaussians simulation. For demonstration purposes, we 
consider a three-class example with p = 2. Each class was generated as a mix- 
ture of 10 Gaussians. Figure 1 shows our method's solution for a range of 
values of the tuning parameter e. In Figure 3 we display the classification 
boundaries of a number of methods. Our method (which we label as "PS," 
for prototype selection) and LVQ succeed in capturing the shape of the 
boundary, whereas .ff-medoids has an erratic boundary; it does not perform 
well when classes overlap since it does not take into account other classes 
when choosing prototypes. 
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Fig. 3. Mixture of Gaussians. Classification boundaries of Bayes, our method (PS), 
K-medoids and LVQ (Bayes boundary in gray for comparison) . 
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Fig. 4. Digits data set. Left: all methods use Euclidean distance and allow prototypes 
to lie off of training points (except for K -medoids) . Right: both use tangent distance and 
constrain prototypes to lie on training points. The rightmost point on our method's curve 
(black) corresponds to 1-NN. 



6.2. ZIP code digits data. We return now to the USPS handwritten digits 
data set, which consists of a training set of n = 7,291 grayscale (16 x 16 
pixel) images of handwritten digits 0-9 (and 2,007 test images). We ran our 
method for a range of values of e from the minimum interpoint distance (in 
which our method retains the entire training set and so reduces to 1-NN 
classification) to approximately the 14th percentile of interpoint distances. 

The left-hand panel of Figure 4 shows the test error as a function of 
the number of prototypes for several methods using the Euclidean metric. 
Since both LVQ and X-means can place prototypes anywhere in the fea- 
ture space, which is advantageous in high-dimensional problems, we also 
allow our method to select prototypes that do not lie on the training points 
by augmenting Z. In this case, we run 10-means clustering on each class 
separately and then add these resulting 100 points to Z (in addition to X). 

The notion of the tangent distance between two such images was intro- 
duced by Simard, Le Cun and Denker (1993) to account for certain invari- 
ances in this problem (e.g., the thickness and orientation of a digit are not 
relevant factors when we consider how similar two digits are) . Use of tangent 
distance with 1-NN attained the lowest test errors of any method [Hastie 
and Simard (1998)]. Since our method operates on an arbitrary dissimilari- 
ties matrix, we can easily use the tangent distance in place of the standard 
Euclidean metric. The righthand panel of Figure 4 shows the test errors 
when tangent distance is used. -fT-medoids similarly readily accommodates 
any dissimilarity. While LVQ has been generalized to arbitrary differentiable 
metrics, there does not appear to be generic, off-the-shelf software available. 
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Table 1 

Comparison of number of prototypes chosen per class to training set size 













Digit 



















1 


2 


3 


4 


5 


6 


7 


8 


9 


Total 


Training set 


1,194 


1,005 


731 


658 


652 


556 


664 


645 


542 


644 


7,291 


PS-best 


493 


7 


661 


551 


324 


486 


217 


101 


378 


154 


3,372 



The lowest test error attained by our method is 2.49% with a 3,372-prototype 
solution (compared to 1-NNs 3.09%). 4 Of course, the minimum of the curve 
is a biased estimate of test error; however, it is reassuring to note that for 
a wide range of e values we get a solution with test error comparable to that 
of 1-NN, but requiring far fewer prototypes. 

As stated earlier, our primary interest is in the interpretative advantage 
offered by our method. A unique feature of our method is that it automat- 
ically chooses the relative number of prototypes per class to use. In this 
example, it is interesting to examine the class- frequencies of prototypes (Ta- 
ble 1). 

The most dramatic feature of this solution is that it only retains seven 
of the 1,005 examples of the digit 1. This reflects the fact that, relative to 
other digits, the digit 1 has the least variation when handwritten. Indeed, the 
average (tangent) distance between digit l's in the training set is less than 
half that of any other digit (the second least variable digit is 7). Our choice 
to force all balls to have the same radius leads to the property that classes 
with greater variability acquire a larger proportion of the prototypes. By 
contrast, i^-medoids requires the user to decide on the relative proportions 
of prototypes across the classes. 

Figure 5 provides a qualitative comparison between centroids from K- 
means and prototypes selected by our method. The upper panel shows the 
result of 10-means clustering within each class; the lower panel shows the 
solution of our method tuned to generate approximately 100 prototypes. 
Our prototypes are sharper and show greater variability than those from K- 
means. Both of these observations reflect the fact that the K- means images 
are averages of many training samples, whereas our prototypes are single 
original images from the training set. As observed in the 3,372-prototype 
solution, we find that the relative numbers of prototoypes in each class for 
our method adapts to the within-class variability. 

Figure 6 shows images of the first 88 prototypes (of 3,372) selected by the 
greedy algorithm. Above each image is the number of training images previ- 



4 Hastie and Simard (1998) report a 2.6% test error for 1-NN on this data set. The 
difference may be due to implementation details of the tangent distance. 
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Fig. 5. (Top) centroids from 10-means clustering within each class. (Bottom) prototypes 
from our method (where e was chosen to give approximately 100 prototypes). The images 
in the bottom panel are sharper and show greater variety since each is a single handwritten 
image. 



ously uncovered that were correctly covered by the addition of this prototype 
and, in parentheses, the number of training points that are miscovered by 
this prototype. For example, we can see that the first prototype selected by 
the greedy algorithm, which was a "1," covered 986 training images of l's 
and four training images that were not of l's. Figure 7 displays these in 
a more visually descriptive way: we have used multidimensional scaling to 
arrange the prototypes to reflect the tangent distances between them. Fur- 
thermore, the size of each prototype is proportional to the log of the number 
of training images correctly covered by it. Figure 8 shows a complete-linkage 
hierarchical clustering of the training set with images of the 88 prototypes. 
Figures 6-8 demonstrate ways in which prototypes can be used to graphi- 
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Fig. 6. First 88 prototypes from greedy algorithm. Above each is the number of training 
images first correctly covered by the addition of this prototype ( in parentheses is the number 
of miscovered training points by this prototype). 
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First BS Prototy pes of Greedy Approach 




Fig. 7. The first 88 prototypes (out of 3,372) of the greedy solution. We perform MDS (R 
function sammon) on the tangent distances to visualize the prototypes in two dimensions. 
The size of each prototype is proportional to the log of the number of correct-class training 
images covered by this prototype. 



cally summarize a data set. These displays could be easily adapted to other 
domains, for example, by using gene names in place of the images. 

The left-hand panel of Figure 9 shows the improvement in the objective, 
A£ — Arj, after each step of the greedy algorithm, revealing an interest- 
ing feature of the solution: we find that after the first 458 prototypes are 
added, each remaining prototype covers only one training point. Since in 




Fig. 8. Complete-linkage hierarchical clustering of the training images (using R package 
glus to order the leaves). We display the prototype digits where they appear in the tree. 
Differing vertical placement of the images is simply to prevent overlap and has no meaning. 
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FlG. 9. Progress of greedy on each iteration. 

this example we took Z = X (and since a point always covers itself), this 
means that the final 2,914 prototypes were chosen to cover only themselves. 
In this sense, we see that our method provides a sort of compromise be- 
tween a sparse nearest prototype classifier and 1-NN. This compromise is 
determined by the prototype-cost parameter A. If A > 1, the algorithm does 
not enter the 1-NN regime. The right-hand panel shows that the test error 
continues to improve as A decreases. 

6.3. Protein classification with string kernels. We next present a case 
in which the training samples are not naturally represented as vectors in 
R p . Leslie et al. (2004) study the problem of classification of proteins based 
on their amino acid sequences. They introduce a measure of similarity be- 
tween protein sequences called the mismatch kernel. The general idea is 
that two sequences should be considered similar if they have a large number 
of short sequences in common (where two short sequences are considered 
the same if they have no more than a specified number of mismatches). 
We take as input a 1,708 x 1,708 matrix with Kij containing the value of 
the normalized mismatch kernel evaluated between proteins i and j [the 
data and software are from Leslie et al. (2004)]. The proteins fall into two 
classes, "Positive" and "Negative," according to whether they belong to a 
certain protein family. We compute pairwise distances from this kernel via 
Dij = yjKu + Kjj — 2Kij and then run our method and .ff-medoids. The left 
panel of Figure 10 shows the 10-fold cross-validated errors for our method 
and K-medoids. For our method, we take a range of equally-spaced quantiles 
of the pairwise distances from the minimum to the median for the parame- 
ter e. For if-medoids, we take as parameter the fraction of proteins in each 
class that should be prototypes. This choice of parameter allows the classes 
to have different numbers of prototypes, which is important in this example 
because the classes are greatly unbalanced (only 45 of the 1,708 proteins 
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Fig. 10. Proteins data set. Left: CV error (recall that the rightmost point on our method's 
curve corresponds to 1-NN). Right: a complete-linkage hierarchical clustering of the neg- 
ative samples. Each selected prototype is marked. The dashed line is a cut at height e. 
Thus, samples that are merged below this line are within e of each other. The number of 
"positive" samples within e of each negative sample, if nonzero, is shown in parentheses. 



are in class "Positive"). The right panel of Figure 10 shows a complete link- 
age hierarchical clustering of the 45 samples in the "Negative" class with 
the selected prototypes indicated. Samples joined below the dotted line are 
within e of each other. Thus, performing regular set cover would result in 
every branch that is cut at this height having at least one prototype sample 
selected. By contrast, our method leaves some branches without prototypes. 
In parentheses, we display the number of samples from the "Positive" class 
that are within e of each "Negative" sample. We see that the branches that 
do not have protoypes are those for which every "Negative" sample has too 
many "Positive" samples within e to make it a worthwhile addition to the 
prototype set. 

The minimum CV-error (1.76%) is attained by our method using about 
870 prototypes (averaged over the 10 models fit for that value of e). This 
error is identical to the minimum CV-error of a support vector machine 
(tuning the cost parameter) trained using this kernel. Fitting a model to the 
whole data set with the selected value of e, our method chooses 26 prototypes 
(of 45) for class "Positive" and 907 (of 1,663) for class "Negative." 

6.4. UCI data sets. Finally, we run our method on six data sets from 
the UCI Machine Learning Repository [Asuncion and Newman (2007)] and 
compare its performance to that of 1-NN (i.e., retaining all training points 
as prototypes), -fT-medoids and LVQ. We randomly select 2/3 of each data 
set for training and use the remainder as a test set. Ten-fold cross-validation 
[and the "1 standard error rule," Hastie, Tibshirani and Friedman (2009)] 
is performed on the training data to select a value for each method's tuning 
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Table 2 

10-fold CV (with the 1 SE rule) on the training set to tune the parameters (our method 

labeled "PS") 



Data 




l-NN/£ 2 


l-NN/4 


PS/^2 


PS/£i 


K-med./£ 2 


K-med./£i 


LVQ 


Diabetes 


Test £rr 


28.9 


31.6 


24.2 


26.6 


32.0 


34.4 


25.0 


(p = 8,L = 2) 


# Protos 


512 


512 


12 


5 


194 


60 


29 


Glass 


Test Err 


38.0 


32.4 


36.6 


47.9 


39.4 


38.0 


35.2 


(p = 9,L = 6) 


# Protos 


143 


143 


34 


17 


12 


24 


17 


Heart 


Test Err 


21.1 


23.3 


21.1 


13.3 


22.2 


24.4 


15.6 


(p = 13,L = 2) 


# Protos 


180 


180 


6 


4 


20 


20 


12 


Liver 


Test Err 


41.7 


41.7 


41.7 


32.2 


46.1 


48.7 


33.9 


(p = 6,L = 2) 


# Protos 


230 


230 


16 


13 


120 


52 


110 


Vowel 


Test Err 


2.8 


1.7 


2.8 


1.7 


2.8 


4.0 


24.4 


(p = 10,L = ll) # Protos 


352 


352 


352 


352 


198 


165 


138 


Wine 


Test Err 


3.4 


3.4 


11.9 


6.8 


6.8 


1.7 


3.4 


(p = 13,L = 3) 


# Protos 


119 


119 


4 


3 


12 


39 


3 



parameter (except for 1-NN). Table 2 reports the error on the test set and 
the number of prototypes selected for each method. For methods taking 
a dissimilarity matrix as input, we use both £2 and t\ distance measures. 
We see that in most cases our method is able to do as well as or better than 
1-NN but with a significant reduction in prototypes. No single method does 
best on all of the data sets. The difference in results observed for using t\ 
versus £2 distances reminds us that the choice of dissimilarity is an important 
aspect of any problem. 

7. Discussion. We have presented a straightforward procedure for se- 
lecting prototypical samples from a data set, thus providing a simple way to 
"summarize" a data set. We began by explicitly laying out our notion of a de- 
sirable prototype set, then cast this intuition as a set cover problem which 
led us to two standard approximation algorithms. The digits data example 
highlights several strengths. Our method automatically chooses a suitable 
number of prototypes for each class. It is flexible in that it can be used in 
conjunction with a problem-specific dissimilarity, which in this case helps 
our method attain a competitive test error for a wide range of values of the 
tuning parameter. However, the main motivation for using this method is 
interpretability: each prototype is an element of X (i.e., is an actual hand 
drawn image). In medical applications, this would mean that prototypes 
correspond to actual patients, genes, etc. This feature should be useful to 
domain experts for making sense of large data sets. Software for our method 
will be made available as an R package in the R library. 
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APPENDIX: INTEGER PROGRAM (3)'S RELATION TO 
PRIZE-COLLECTING SET COVER 

Claim. Solving the integer program of (3) is equivalent to solving L 
prize- collecting set cover problems. 

Proof. Observing that the constraints (3b) are always tight, we can 
eliminate r)i,...,r) n in (3), yielding 



minimize 



a j i i i:x ie B( Zj ) j,l 



a f ] > 1 - & Vx i£ X, 

j : x t eB(zj) 

afe{0,l} Vj,l, &>0 Vi. 
Rewriting the second term of the objective as 

n n 

E E ^E^EHx^z^x^} 

i=l j:Xi£B(zj) j,l i=l 

l¥=Vi 

= 5>y ) |2J(* i )n(*\A5)| 

and letting C ; (j) = A + |£(zj) n (<¥ \ X t )\ gives 

L 

minimize 



E^+E^>? 



a ) & 1=1 

s.t. for each class I: 

^ af>l-£i Vxi€Af I( 

i:x,e-B(zj) 

afe{0,l} Vj, >0 ViiXiGAT,. 

This is separable with respect to class and thus equivalent to L separate 
integer programs. The Ith integer program has variables af ,...,«m' and 
{£j:xj 6 X{\ and is precisely the prize-collecting set cover problem of (4). 

□ 



Acknowledgments. We thank Sam Roweis for showing us set cover as 
a clustering method, Sam Roweis, Amin Saberi, Daniela Witten for helpful 
discussions, and Trevor Hastie for providing us with his code for computing 
tangent distance. 



22 



J. BIEN AND R. TIBSHIRANI 



REFERENCES 

Asuncion, A. and Newman, D. J. (2007). UCI Machine Learning Repository. Univ. 

California, Irvine, School of Information and Computer Sciences. 
Basu, M. and Ho, T. K. (2006). Data Complexity in Pattern Recognition. Springer, 

London. 

Cannon, A. H. and Cowen, L. J. (2004). Approximation algorithms for the class cover 

problem. Ann. Math. Artif. Intell. 40 215-223. MR2037478 
Cano, J. R., Herrera, F. and Lozano, M. (2003). Using evolutionary algorithms as in- 
stance selection for data reduction in KDD: An experimental study. IEEE Transactions 

on Evolutionary Computation 7 561-575. 
Cano, J. R., Herrera, F. and Lozano, M. (2007). Evolutionary stratified training set 

selection for extracting classification rules with trade off precision-interpretability. Data 

and Knowledge Engineering 60 90-108. 
Ceyhan, E., Priebe, C. E. and Marchette, D. J. (2007). A new family of random 

graphs for testing spatial segregation. Canad. J. Statist. 35 27-50. MR2345373 
Cover, T. M. and Hart, P. (1967). Nearest neighbor pattern classification. Proc. IEEE 

Trans. Inform. Theory IT-11 21-27. 
Devijver, P. A. and Kittler, J. (1982). Pattern Recognition: A Statistical Approach. 

Prentice Hall, Englewood Cliffs, NJ. MR0692767 
DeVinney, J. and Wierman, J. C. (2002). A SLLN for a one-dimensional class cover 

problem. Statist. Probab. Lett. 59 425-435. MR1935677 
Fayed, H. A. and Atiya, A. F. (2009). A novel template reduction approach for the 

A'-nearest neighbor method. IEEE Transactions on Neural Networks 20 890-896. 
Feige, U. (1998). A threshold of Inn for approximating set cover. J. ACM 45 634-652. 

MR1675095 

Friedman, J. H., Hastie, T. and Tibshirani, R. (2010). Regularization paths for gen- 
eralized linear models via coordinate descent. Journal of Statistical Software 33 1-22. 

Hart, P. (1968). The condensed nearest-neighbor rule. IEEE Trans. Inform. Theory 14 
515-516. 

Hastie, T. and Simard, P. Y. (1998). Models and metrics for handwritten digit recog- 
nition. Statist. Sci. 13 54-65. 

Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learn- 
ing: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York. MR2722294 

Hussain, Z., Szedmak, S. and Shawe- Taylor, J. (2004). The linear programming set 
covering machine. Pattern Analysis, Statistical Modelling and Computational Learning. 

Kohonen, T. (2001). Self-Organizing Maps, 3rd ed. Springer Series in Information Sci- 
ences 30. Springer, Berlin. MR1844512 

Konemann, J., Parekh, O. and Segev, D. (2006). A unified approach to approximat- 
ing partial covering problems. In Algorithms — ESA 2006. Lecture Notes in Computer 
Science 4168 468-479. Springer, Berlin. MR2347166 

Leslie, C. S., Eskin, E., Cohen, A., Weston, J. and Noble, W. S. (2004). Mismatch 
string kernels for discriminative protein classification. Bioinformatics 20 467-476. 

Lozano, M., Sotoca, J. M., Sanchez, J. S., Pla, F., Pkalska, E. and Duin, R. P. W. 
(2006). Experimental study on prototype optimisation algorithms for prototype-based 
classification in vector spaces. Pattern Recognition 39 1827-1838. 

Marchand, M. and Shawe- Taylor, J. (2002). The set covering machine. J. Mach. 
Learn. Res. 3 723-746. 

Marchiori, E. (2010). Class conditional nearest neighbor for large margin instance se- 
lection. IEEE Trans. Pattern Anal. Mach. Intell. 32 364-370. 



PROTOTYPE SELECTION 



23 



Park, M. Y. and Hastie, T. (2007). Li-regularization path algorithm for generalized 
linear models. J. R. Stat. Soc. Ser. B Stat. Methodol. 69 659-677. MR2370074 

Priebe, C. E., DeVinney, J. G., Marchette, D. J. and Socolinsky, D. A. (2003). 
Classification using class cover catch digraphs. J. Classification 20 3-23. MR1983119 

Ripley, B. D. (2005). Pattern Recognition and Neural Networks. Cambridge Univ. Press, 
New York. 

Simard, P. Y., Le Cun, Y. A. and Denker, J. S. (1993). Efficient pattern recognition 

using a new transformation distance. In Advances in Neural Information Processing 

Systems 50-58. Morgan Kaufmann, San Mateo, CA. 
Takigawa, I., Kudo, M. and Nakamura, A. (2009). Convex sets as prototypes for 

classifying patterns. Eng. Appl. Artif. Intell. 22 101-108. 
Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple 

cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. USA 99 

6567-6572. 

Tipping, M. E. and Scholkopf, B. (2001). A kernel approach for vector quantization 
with guaranteed distortion bounds. In Artificial Intelligence and Statistics (T. Jaakkola 
and T. Richardson, eds.) 129-134. Morgan Kaufmann, San Francisco. 

Vazirani, V. V. (2001). Approximation Algorithms. Springer, Berlin. MR1851303 

Wilson, D. R. and Martinez, T. R. (2000). Reduction techniques for instance-based 
learning algorithms. Machine Learning 38 257-286. 

Zhu, J., Rosset, S., Hastie, T. and Tibshirani, R. (2004). 1-norm support vector ma- 
chines. In Advances in Neural Information Processing Systems 16 (S. Thrun, L. Saul 
and B. Scholkopf, eds.). MIT Press, Cambridge, MA. 



Department of Statistics 
Stanford University 
Sequoia Hall 
390 Serra Mall 
Stanford, California 94305 
USA 

E-MAIL: jbien@stanford.edu 



Departments 

of Health, Research, and Policy 

and Statistics 
Stanford University 
Sequoia Hall 
390 Serra Mall 
Stanford, California 94305 
USA 

E-MAIL: tibs@stanford.cdu 



